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Traffic  in  a   Shared   Memory 
Parallel  Computer* 

Susan  Dickey,^  Allan  Gottlieb,''  Richard  Kenner,^ 
and  Yue-Sheng  Liu^ 


Abstract.  Serialization  of  memory  access  can  be  a  critical  bottleneck  in  shared 
memop.  parallel  computers.  The  NYU  Ultracomputer,  a  large-scale  MIMD  (multiple 
msiruciion  stream,  multiple  data  stream  )  shared  memory  architecture,  may  be  viewed 
as  a  column  of  processors  and  a  column  of  memory  modules  connected  by  a 
rectangular  network  of  enhanced  2x2  buffered  crossbars.  These  VLSI  nodes  enable 
the  network  to  combine  multiple  requests  directed  at  the  same  memory  location. 
Such  requests  include  a  new  coordination  primitive,  fetch-and-add,  which  permits 
task  coordination  to  be  achieved  in  a  highly  parallel  manner.  Processing  within  the 
network  is  used  to  reduce  serialization  at  the  memory  modules. 

To  avoid  large  network  latency,  the  VLSI  network  nodes  must  be  high-perform- 
ance components.  Design  tradeoffs  between  architectural  features,  asymptotic  per- 
formance requirements,  cycle  time,  and  packaging  limitations  are  complex.  Tliis 
repon  sketches  the  Ultracomputer  architecture  and  discusses  the  issues  involved  in 
the  design  of  the  VLSI  enhanced  buffered  crossbars  which  are  the  key  element  in 
reducing  serialization. 

1.  Introduction 

Highly  parallel  computers  composed  of  thousands  of  processors  and 
gigabytes  of  memory  have  the  potential  to  solve  problems  with  vast  computa- 
tional demands.  With  10-20  MIPS  and  a  megabyte  of  memory  soon  to  be 
available  on  a  few  chips,  such  highly  parallel  machines  can  be  built  with 
roughly  the  same  component  count  as  the  current  generation  of  super- 
computers. 

Effective  utilization  of  such  a  high-performance  assemblage  demands  an 
integrated  hardware/software  approach.  Several  thousand  processors  must 
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be  coordinated  in  such  a  way  that  their  aggregate  power  is  applied  to  useful 
computation.  Classic  techniques  for  interprocessor  coordination  use 
"critical  sections" — serial  procedures  in  which  one  processor  accesses  a 
shared  resource  while  the  others  wait.  In  any  highly  parallel  architecture, 
such  an  approach  is  inadequate,  since  the  cost  of  critical  sections  rises 
linearly  with  the  number  of  processors.  Serial  procedures  become  bottle- 
necks that  drastically  reduce  the  power  obtained  and  thus  must  be 
eliminated. 

Our  group  has  proposed  [7]  that  the  hardware  and  software  design  of 
a  highly  parallel  computer  should  meet  the  following  goals. 

•  Scaling.  Effective  performance  should  scale  upward  to  a  very  high 
level.  Given  a  problem  of  sufficient  size,  an  n-fold  increase  in  the 
number  of  processors  should  yield  a  speedup  factor  of  almost  n. 

•  General  purpose.  The  machine  should  be  capable  of  efficient  execution 
of  a  wide  class  of  algorithms,  displaying  relative  neutrality  with  respect 
to  algorithmic  structure  or  data  flow  pattern. 

•  Progammabiliiy.  High-level  programmers  should  not  have  to  consider 
the  machine's  low-level  structural  details  in  order  to  write  efficient 
programs.  Programming  and  debugging  should  not  be  substantially 
more  difficult  than  on  a  serial  machine. 

•  Multiprogramming.  The  software  should  be  able  to  allocate  processors 
and  other  machine  resources  to  different  phases  of  one  job  and/or  to 
different  user  jobs  in  an  efficient  and  highly  dynamic  way. 

Section  2  reviews  the  MIMD  shared  memory  computational  model  on 
which  the  Ultracomputer  is  based  and  outlines  a  hardware  design  closely 
appro.ximating  this  model.  Section  3  discusses  selected  issues  in  the  design 
of  the  network  using  custom  VLSI  switches.  We  conclude  with  a  brief  report 
of  the  current  status  and  future  plans  of  our  VLSI  effon. 

A  series  of  technical  reports,  referred  to  as  "Ultracomputer  Notes,"  have 
been  prepared  by  researchers.  Readers  wishing  more  information  should 
write  to  Michael  Passaro  at  251  Mercer  Street,  New  York,  New  York  10012, 
USA  (passaro^nyu.arpa). 

2.  Ultracomputer  architecture 

In  this  section  we  review  the  parallel  computation  model  on  which  the 
Ultracomputer  is  bated.  Although  this  idealized  machine  is  not  physically 
realizable,  we  show  that  a  close  approximation  can  be  built. 

2.1.   The  model 

An  idealized  parallel  processor,  dubbed  a  "paracomputer"  by  Schwartz 
[23]  and  classified  as  a  WRAM  by  Borodin  and  Hopcroft  [1],  consists  of 
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a  number  of  autonomous  processing  elements  (PEs)  sharing  a  central 
memory  (see  also  Fortune  and  Wyllie  [5]  and  Snir  [24]).  Every  PE  is 
permitted  to  read  or  write  a  shared  memory  cell  each  cycle.  In  particular, 
simultaneous  reads  and  writes  directed  at  the  same  memory  cell  are  all 
accomplished  in  a  single  cycle. 

The  serialization  principle  (see  Eswaran  et  al.  [4])  states  that  the  effect 
of  simultaneous  actions  by  the  PEs  is  as  if  the  actions  had  occurred  in  some 
(unspecified)  serial  order.  Thus,  for  example,  a  load  simultaneous  with  two 
stores  directed  at  the  same  memory  cell  will  return  either  the  original  value 
or  one  of  the  two  stored  values.  The  returned  value  may  be  different  from 
the  value  that  the  cell  finally  comes  to  contain.  In  this  model,  simultaneous 
memory  updates  are  in  fact  accomplished  in  one  cycle;  the  serialization 
principle  makes  precise  the  effect  of  simultaneous  access  to  shared  memory 
but  does  not  prescribe  its  implementation. 

In  an  actual  hardware  implementation,  single  cycle  access  to  globally 
shared  memor\'  cannot  be  achieved.  For  any  technology  there  is  a  limit, 
say  fc,  on  the  number  of  signals  that  one  can  fan  in  at  once.  Thus,  if  N 
processors  are  to  access  even  a  single  bit  of  shared  memory,  the  shortest 
access  time  possible  is  log^  N.  Hardware  achieving  this  logarithmic  access 
time,  even  when  many  processors  simultaneously  access  a  single  cell,  has 
been  designed,  but  does  not  use  off  the  shelf  components.  A  custom  VLSI 
design  is  needed  for  the  switching  components  used  in  the  processor  to 
memory  interconnection  network. 

This  network  adds  significantly  to  the  size  of  the  machine  and  to  its 
replication  costs.  For  N  processors  and  N  memory'  modules,  N  log  N 
switching  components  are  required.  This  results  in  an  inherently  lower  peak 
performance  than  that  of  a  design  of  equivalent  size  in  which  the  processors 
themselves  act  as  switching  components  without  globally  shared  memory. 
For  any  metric  (dollars,  cubic  feet,  BTUs,  etc.)  the  shared  memory  design 
with  a  connection  network  will  contain  fewer  processors  or  memory  cells 
than  a  private  memory  design  with  only  wires  connecting  the  processors. 
We  believe  that  the  increased  flexibility  and  generality  of  shared  memory 
designs  adequately  compensates  for  their  lower  peak  performance,  but  this 
issue  has  not  been  settled.  Most  likely  the  answer  will  prove  to  be  so 
application  dependent  that  both  shared  and  private  memory  designs  will 
prove  successful. 

2.2.    The  fetch-and-add  operation 

We  augment  the  shared  memory  model  described  above  with  the  "fetch-and- 
add"  (F&A)  operation.  Tliis  operation  is  an  indivisible  add  to  memory;  its 
format  is  F&A(A',  e),  where  A'  is  an  integer  variable  and  e  is  an  integer 
expression.  The  operation  is  defined  to  return  the  (old)  value  of  X  and  to 
replace  X  by  the  sum  A'  +  e. 
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Concurrent  fetch-and-adds  are  required  to  satisfy  the  serialization  prin- 
ciple. Fetch-and-add  operations  simultaneously  directed  at  X  cause  it  to 
be  modified  by  the  sum  of  the  increments.  Each  operation  yields  the 
intermediate  value  of  X  corresponding  to  some  order  of  execution.  The 
following  example  illustrates  a  typical  use  of  fetch-and-add:  Consider 
several  PEs  concurrently  executing  F&A(/,  1),  where  /  is  a  shared  variable 
used  as  an  index  into  a  shared  array.  Each  PE  obtains  an  index  to  a  distinct 
array  element  (although  one  cannot  predict  which  element  will  be  assigned 
to  which  PE),  and  /  is  incremented  by  the  number  of  PEs  executing  the  F&A. 

Fetch-and-add  is  a  powerful  interprocessor  synchronization  operation 
that  permits  highly  concurrent  execution  of  both  operating  system  primitives 
and  application  programs  (see  Gottlieb  and  Kniskal  [9]).  Using  the  fetch- 
and-add  operation,  we  can  perform  many  important  algorithms  in  a  com- 
pletely parallel  manner,  i.e.,  without  using  any  critical  sections.  For  example, 
as  indicated  above,  concurrent  executions  of  F&A(/,  1)  yield  consecutive 
values  that  may  be  used  to  index  an  array.  If  this  array  is  interpreted  as  a 
(sequentially  stored)  queue,  the  values  returned  may  be  used  to  perform 
concurrent  inserts;  analogously  F&A(  D,  1 )  may  be  used  for  concurrent 
deletes.  The  complete  queue  algorithms  contain  checks  for  overflow  and 
underflow,  collisions  between  insert  and  delete  pointers,  etc.  (see  Gottlieb 
ei  al.  [10]).  We  are  unaware  of  any  other  completely  parallel  solutions  to 
this  problem.  To  illustrate  the  nonserial  behavior  obtained,  we  note  that 
given  a  single  queue  that  is  neither  empty  nor  full,  the  concurrent  execution 
of  thousands  of  inserts  and  thousands  of  deletes  can  all  be  accomplished 
in  the  time  required  for  just  one  such  operation. 


2.3.   Hardware  realization 

The  direct  single  cycle  access  to  shared  memory  characteristic  of  paracom- 
puters  is  approximated  in  the  NYU  Ultracomputer  by  indirect  access  via 
a  multicycle  connection  network.  A  message  switching  network  with  the 
topology  of  Lawrie's  [18]  O-network  (equivalently,  the  SW  Banyan  of  Goke 
and  Lipovsky  [6])  is  used  to  connect  N  =  2°  autonomous  PEs  to  a  central 
shared  memory  composed  of  N  memory  modules  (MMs).  Figure  1  gives 
a  block  diagram  of  the  machine. 

The  Ultracomputer  design  places  few  constraints  on  the  processors  and 
memory  modules.  Naturally,  the  fetch-and-add  instruction  is  needed.  In 
addition,  the  presence  of  a  memory  which  has,  when  viewed  through  the 
network,  a  high  bandwidth  and  nonnegligible  latency  strongly  favors  pro- 
cessors that  permit  prefetching  of  instructions  and  operands.  Furthermore, 
memory  should  be  interleaved  to  prevent  clustering  of  references  to  a  given 
module,  and  should  be  augmented  with  an  adder  to  perform  the  fetch-and- 
add  operation  aiomically. 
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Figure  1.  Block  diagram  of  the  NYU  Ultracomputer. 


Implementing  a  cache  at  each  PE  permits  rapid  access  to  frequently  used 
instructions  and  data  and  reduces  network  traffic.  Unfortunately,  caching 
of  read-write  shared  variables  presents  a  coherence  problem  among  the 
various  caches.  An  ob\ious  method  of  eliminating  this  problem  is  to  simply 
not  cache  read-write  shared  variables  and  have  the  software  distinguish 
between  shared  and  private  variables,  typically  by  grouping  them  into 
separate  memor\-management  segments.  A  more  elaborate  scheme  is  based 
on  the  observation  that  if,  during  a  particular  code  segment,  a  shared 
variable  is  accessed  read-only,  or  accessed  only  privately,  then  this  variable 
ma\  be  cached  during  execution  of  that  segment  (McAuliffe  [20]). 

The  enhanced  fl-network.  connecting  processors  to  memory  that  we 
proposed  for  the  Ultracomputer  achieves  the  following  objectives; 

•  Bandwidth  linear  in  N,  the  number  of  PEs. 

•  Latency  (memory  access  time)  logarithmic  in  N. 

•  Only  0{N  log  N)  identical  components. 

•  Routing  decisions  local  to  each  switch;  thus  routing  is  not  a  serial 
bottleneck  and  is  efficient  for  short  messages. 


Details  of  the  network  design  are  given  in  the  following  section  and  in 
Gottlieb  [7].  See  Figure  2  for  a  diagram  of  an  H-network.  Note  that  there 
exists  a  unique  path  connecting  each  PE-MM  pair;  the  method  for  using 
such  a  network  to  implement  memory  loads  and  stores  is  well  known. 

Although  issued  by  the  processor,  fetch-and-add  operations  are  effected 
in  the  MMs.  Since  memor>  modules  operate  sequentially,  only  one  request 
may  be  satisfied  in  each  cycle.  If  concurrent  fetch-and-add  or  load  operations 
were  to  be  serialized  at  the  memory  of  a  real  parallel  computer,  we  would 
lose  the  advantage  of  parallel  coordination  algorithms,  having  merely  moved 
the  critical  sections  from  the  software  to  the  hardware.  Instead,  memory 
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Figure  2.  An  8  PE  omega  network. 


requests  to  identical  locations  are  combined  when  they  meet  at  a  switch 
(see  KJappholz  [13]  and  Gottlieb  et  al.  [10]). 

Combining  (merging)  requests  reduces  communication  traffic  and  thus 
decreases  the  length  of  the  queues  within  the  switches,  leading  to  lower 
network  latency  (i.e.,  reduced  memory  access  time).  Furthermore,  it  can 
help  to  prevent  saturation  of  the  network  when  "hotspot"  traffic  directs 
manv  requests  to  a  single  location  (Pfister  and  Norton  [22]  and  Lee  et  al. 
[19]). 

Enhanced  switches  permit  the  network  to  combine  fetch-and-adds  with 
the  same  efficiency  as  it  combines  loads  and  stores.  When  two  fetch-and-adds 
referencing  the  same  shared  variable,  say  F&A(X,  e)  and  F&A(X,/),  meet 
at  a  switch,  the  switch  forms  the  sum  e-\-f,  transmits  the  combined  request 
F&A(A',  e+f),  and  stores  the  value  e  in  its  local  memory.  When  the  value 

Y  is  returned  to  the  switch  in  response  to  F&A(  A',  e  +/),  the  switch  transmits 

Y  to  satisfy  one  request  (F&A(X,  e))  and  transmits  Y+e  to  satisfy  the 
other  request  (F&A(X,/)).  Assuming  that  the  combined  request  was  not 
further  combined  with  yet  another  request,  memory  location  X  is  properly 
incremented,  becoming  X  +  e+f.  If  other  fetch-and-add  operations  updat- 
ing X  are  encountered,  the  combined  requests  are  themselves  combined. 
The  associativity  of  addition  guarantees  that  this  procedure  gives  a  result 
consistent  with  the  serialization  principle.  Note  that  other  associative 
operations  can  be  combined  in  a  similar  manner. 
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3.  Network  design 

Designing  a  network,  having  the  above  functionality  and  providing  adequate 
performance  for  a  given  configuration  involves  many  tradeoffs.  The 
asymptotically  best  organization  may  not  necessarily  attain  the  best  perform- 
ance when  constructed  in  a  given  VLSI  technology.  The  following  four 
subsections  discuss  in  turn  message  transmission  protocols  and  the  effect 
of  packaging  limitations,  network  performance  analysis,  the  structure  of 
buffers  internal  to  the  switch,  and  the  cost  of  combining. 


3.1.   Message  transmission  and  packaging  limitations 

Bandwidth  proportional  to  the  number  of  PEs  has  been  achieved  by  provid- 
ing queues  within  each  switch.  These  allow  concurrent  processing  of  requests 
routed  to  the  same  port  whenever  possible.  The  alternative  adopted  by 
Burroughs  (Hauck  and  Dent  [12])  of  killing  one  of  the  two  conflicting 
requests  limits  bandwidth  to  0(N/log  N);  see  Kruskal  and  Snir  [14]. 

In  addition,  the  network  is  pipelined.  Paths  through  the  network  are  not 
maintained  while  awaiting  memory  responses.  Thus  the  rate  at  which  the 
processor  may  issue  memory  requests  is  limited  by  the  switch  cycle  time 
rather  than  the  network  transit  time. 

The  delay  inherent  in  off-chip  communication  between  VLSI  switching 
nodes  is  likely  to  be  a  greater  constraint  on  network  performance  than  the 
rate  at  which  information  can  be  processed  within  each  node.  Significant 
additional  logic  can  be  included  in  each  node  with  advantage  when  that 
logic  would  help  avoid  global  signaling  or  reduce  queuing  delays  by  combin- 
ing messages.  On  the  other  hand,  for  a  lightly  loaded  network  the  basic 
cycle  speed  of  the  switch  is  the  most  important  factor  in  determining  the 
latency  of  the  network.  If  processing  within  a  node  is  overlapped  with  data 
transfer  between  nodes,  an  increase  in  internal  complexity  may  be  tolerated 
without  lengthening  the  cycle.  TTie  degree  to  which  this  can  be  done  depends 
on  the  current  state  of  technology. 

The  number  of  chips  required  to  implement  each  switching  node  appears 
likely  to  be  determined  by  the  pin  count  required  at  each  node,  rather  than 
the  silicon  area  of  the  switching  logic.  Logic  density  per  chip  has  been 
increasing  much  faster  than  pin  density  per  chip.  The  number  of  pins 
available  per  chip  is  still  too  small  to  construct  even  a  2x2  bidirectional 
switch  which  processes  single  packet  messages  (address,  control,  and  data) 
for  a  32-bit  machine.  Therefore,  messages  must  be  split  into  multiple  packets 
and  one  of  two  methods  can  be  used  to  transmit  these  packets  through  the 
network.  The  first  is  a  bit-sliced  implementation  in  which  different  com- 
ponents are  handling  different  packets  of  one  message  (transmission  of 
messages  is  "space-multiple.xed").  Or  the  transmission  of  successive  packets 
of  a  message  can  be  time-multiplexed  to  the  same  component. 
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Space-multiplexing  provides  a  higher  bandwidth  than  time-multiplexing 
at  the  expense  of  more  components.  However,  a  large  amount  of  "horizon- 
tal" communication  and  coordination  must  then  take  place  between  the 
different  components  of  a  switch.  TTie  communication  between  components 
required  to  match  addresses  and  add  data  when  combining  is  especially 
costly  in  both  the  complexity  of  such  implementation  and  the  switch  cycle 
time.  For  MOS  technologies,  the  off-chip  delays  impose  an  especially  high 
overhead.  Furthermore,  module  packaging  limitations  are  such  that,  even 
if  several  chips  are  used  to  implement  single  node,  some  degree  of  time- 
multiplexing  may  be  required  in  order  to  package  a  network  node  on  a 
given  module. 

Several  cycles  are  required  to  transmit  each  message  if  time-multiplexing 
is  used.  However,  the  internal  logic  of  the  switch  can  be  pipelined  so  that 
messages  can  be  handled  on  a  per  packet  basis  and  do  not  have  to  be 
assembled  at  each  switch.  Thus,  there  can  be  as  little  as  a  one  cycle  delay 
per  switch  when  queues  are  empty  and  hence  time-multiplexing  contributes 
an  additive  term  to  the  latency  rather  than  a  multiplicative  factor.  However, 
queuing  delays  increase  quadratically  with  the  multiplexing  factor,  so  that 
the  performance  of  the  network  under  heavy  load  may  be  seriously  impaired 
(Kruskal  ei  al.  [15]). 

For  any  given  state  of  technology,  packaging  constraints  do  not  determine 
a  unique  design  for  the  network.  For  the  same  number  of  pins  per  chip,  it 
is  possible  to  replace  2x2  switches  by  kxk  switches,  time  multiplexing 
each  pin  by  a  factor  of  k/2.  Dividing  a  message  into  more  packets  may  or 
may  not  involve  an  increase  in  cycle  time,  depending  on  the  nature  of  the 
detailed  VLSI  design.  Breaking  up  parts  of  the  address  or  the  data  into 
different  packets  will  increase  the  internal  complexity  required  to  handle 
matching  and  addition,  but  fewer  bits  per  packet  will  shorten  certain  global 
control  lines.  The  increased  logic  to  perform  kxk  switching  rather  than 
2x2  switching  is  likely  to  increase  cycle  time,  but  possibly  not  by  a  significant 
amount.  In  the  following  subsection,  we  present  performance  analyses  of 
various  networks  in  order  to  indicate  the  tradeoffs  involved. 

3.2.   Network  performance  analysis 

A  particular  configuration  is  characterized  by  the  values  of  the  following 
parameters: 

k      The  size  of  the  switch.  Recall  that  a  kx  k  switch  requires  4A:  ports. 
m     The  time  multiplexing  factor,  i.e.,  the  number  of  switch  cycles 

required  to  input  a  message.  (To  simplify  the  analysis  we  assume 

that  all  the  messages  have  the  same  length.) 
t       The  switch  cycle  time. 

Note  that  for  any  A:  a  network  with  n  inputs  and  n  outputs  can  be  built 
from  ibnk)k  xk  switches  and  a  proportional  number  of  wires,  where  b  is 
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logji  n  and  n  is  a  power  of  L  (If  n  is  not  a  power  of  k,  some  redundancy 
or  unused  connections  may  be  required  to  use  identical  parts.)  Since  our 
network  contains  a  large  number  of  identical  switches,  the  network's  cost 
may  be  considered  proportional  to  the  number  of  switches. 

In  order  to  obtain  a  tractable  mathematical  model  of  the  network  the 
following  simplifying  assumptions  are  made  for  the  remainder  of  this 
section: 

•  Requests  are  not  combined. 

•  Messages  have  the  same  length. 

•  Queues  are  of  infinite  size. 

•  Requests  are  generated  at  each  PE  by  independent,  identically  dis- 
tributed, time-invariant  random  processes. 

•  MMs  are  equally  likely  to  be  referenced. 

Let  p  be  the  average  number  of  messages  entered  into  the  network  by  each 
PE  per  time  unit.  If  the  queues  at  each  switch  are  large  enough  ("infinite 
queues'")  then  the  average  switch  delay  at  the  first  stage  is 

m'p{\  -  \/ km) 


2(]-mp] 

and  the  average  switch  delay  at  later  stages  is  approximately 

4mp\fm'p(\-\/k] 


5k  /\    2{l -mp) 

(see  Kruskal  ei  al.  [15]).  To  compute  the  average  network  traversal  time  T 
(in  one  direction),  sum  the  individual  stage  delays  plus  the  setup  time  for 
the  pipe,  i.e.,  (  m  -  1  )t. 

Note  that  the  network  has  a  capacity  of  \/ m  messages  per  switch  cycle 
per  PE.  Tliat  is,  each  PE  cannot  enter  messages  at  a  rate  higher  than  one 
per  m  cycles,  and,  conversely,  the  network  can  accommodate  any  traffic 
below  this  threshold.  Thus,  the  global  bandwidth  of  the  network  is  theoreti- 
cally proportional  to  the  number  of  PEs  connected  to  it. 

The  initial  t  in  the  expressions  for  the  switch  delay  corresponds  to  the 
time  required  for  a  message  to  be  transmitted  through  a  switch  without 
being  queued  (the  switch  service  time).  TTie  second  term  corresponds  to 
the  average  queuing  delay.  This  term  decreases  to  zero  when  the  traffic 
intensity  p  decreases  to  zero  and  increases  to  infinity  when  traffic  intensity 
p  increases  to  the  \/m  threshold.  TTie  surprising  feature  of  this  formula  is 
the  m'  factor,  which  is  explained  by  noting  that  the  queuing  delay  for  a 
switch  with  a  multiplexing  factor  of  m  is  roughly  the  same  as  the  queuing 
delay  for  a  switch  with  a  multiplexing  factor  of  one,  a  cycle  m?  times 
longer,  and  m  times  as  much  traffic  per  cycle  [8]. 
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Figure    3.    (a)    Network    configurations    for   64    PEs    (switch    cycle   time    50ns 
(b)  Network  configurations  for  4096  PEs  (switch  cycle  time  50  ns). 


Designing  VLSI  Network  Nodes    227 

We  have  plotted  in  Figure  3  the  graphs  of  T,  the  network  transit  time, 
as  a  function  of  the  messages  per  microsecond  for  different  values  of  k  and 
m  with  I  equal  to  50  nanoseconds.  Note  that  the  performance  advantage 
of  having  fewer  stages  in  the  network  {k  greater  than  2)  is  easily  outweighed 
by  the  increased  queuing  delay  due  to  more  packets  per  message  (m  large). 
This  is  true  even  when  the  cycle  time  is  the  same;  in  practice,  one  would 
expect  the  more  complex  k  x  k  switches  to  have  a  longer  cycle  time. 

Figure  4  shows  the  effects  of  increasing  and  decreasing  cycle  time.  A 
variety  of  different  switch  structures  can  give  roughly  the  same  performance. 
For  a  fixed  offered  load  coming  from  the  processor,  increasing  the  switch 
cycle  time  has  the  effect  of  increasing  the  traffic  intensity  p  as  seen  by  the 
switch  as  well  as  increasing  the  service  time.  Thus,  decreasing  the  cycle 
time  by  20%  (from  50  to  40  nanoseconds)  can  improve  performance  more 
than  going  from  a2x2toa4x4  network.  A  comparatively  fast  cycle  time 
can  make  a  switch  with  8-packet  time-multiplexing  attractive.  (Such  a  switch 
is  likely  to  be  faster  only  for  a  noncombining  switch,  however,  due  to  the 
difficulties  of  matching  across  packets.)  On  the  other  hand,  the  cycle  time 
can  increase  from  50  to  100  nanoseconds  for  a  4x4  switch  without  losing 
performance  if  m  is  cut  from  4  to  2. 

TTie  above  discussion  of  performance  has  ignored  the  cost  of  constructing 
the  network.  Since  the  component  count  for  64  PE  network  using  2x2  nodes 
is  192,  for  4x4  is  48,  and  for  8  x  8  is  16,  Figure  5  compares  networks  of 
comparable  component  cost,  assuming  50  nanosecond  switch  times  per 
component.  The  transit  time  at  traffic  intensity  p  for  a  message  when  there 
are  d  copies  of  a  given  network  is  the  same  as  for  one  copy  of  that  network 
with  a  traffic  intensity  p/d. 

Limitations  on  offered  load  due  to  processors  waiting  for  outstanding 
memory  requests  make  it  unlikely  that  more  than  one  copy  of  the  network 
will  be  desirable  (except  for  fault  tolerance),  unless  more  than  one  PE  feeds 
into  each  network  port  (see  Norton  and  Pfister  [21]).  Results  of  simulations 
for  a  64-PE  network  with  small  buffer  sizes  at  each  node  (shown  in  Figure 
6)  show  that  an  actual  load  on  the  network  of  much  more  than  one  message 
per  microsecond  is  unlikely  for  processors  with  no  prefetch  capability.  The 
processor's  rate  of  issuing  requests  is  slowed  by  waiting  for  a  response  from 
memory  even  at  the  minimum  network  transit  time.  The  effective  throughput 
is  considerably  below  even  the  slow  memory's  maximum  bandwidth  of  3.3 
messages  per  microsecond.  Extra  copies  of  the  network  provide  little  help 
when  traffic  intensity  is  that  low. 

A  final  determination  of  an  optimal  configuration  requires  more  accurate 
assessments  of  the  technological  constraints  and  the  traffic  distribution.  The 
pipelining  delays  incurred  for  large  multiplexing  factors,  the  complexity  of 
large  switches,  and  the  heretofore  ignored  cost  and  performance  penalty 
incurred  with  interfacing  many  network  copies,  will  probably  make  the  use 
of  switches  larger  than  8x8  impractical  for  a  4K  PE  parallel  machine. 
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Figure  4.  (a)  Effect  of  changing  cycle  times  (64  PEs).  (b)  Effect  of  changing  cycle 
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Figure  5.  (a)  Networks  of  the  same  number  of  components  (64  PEs)  (switch  cycle 
time  50  ns).  (b)  Networks  of  the  same  number  of  components  (4096  PEs)  (switch 
cvcle  time  50  ns). 


The  previous  discussion  assumed  a  one-chip  implementation  of  each 
switch.  By  using  the  two-chip  implementation  described  in  the  next  subsec- 
tion one  can  double  the  bandwidth  of  each  switch  while  doubling  the  chip 
count.  As  delays  are  highly  sensitive  to  the  multiplexing  factor  m,  this 
implementation  would  yield  a  better  performance  than  that  obtained  by 
taking  two  copies  of  a  network  built  of  one-chip  switches.  (It  would  also 
have  the  extra  advantage  of  decreasing  the  silicon  area  required  on  each 
chip.) 

V\'e  now  return  to  the  five  assumptions  listed  above.  The  first  two  assump- 
tions, that  all  messages  traverse  the  entire  network  and  are  of  equal 
(maximal^  length,  are  clearly  conservative.  In  practice,  combined  messages 
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Figure  6.  (a)  Effective  throughput  versus  offered  load,  variable  prefetch  (simulation 
of  a  64  PE  network,  two-input  finite  queues,  six  packet  queue  size  on  forward  path, 
10  on  return  path,  50  ns  switch  cycle  time,  300  ns  memory  cycle  time),  (b)  Latency 
versus  offered  load,  variable  prefetch  (simulation  of  a  64  PE  network,  two-input 
finite  queues,  six  packet  queue  size  on  forward  path,  10  on  return  path,  50  ns  switch 
cycle  lime,  300  ns  memory  cycle  lime). 


do  not  each  traverse  the  entire  network,  and  messages  that  do  not  carry 
data  (load  requests  and  store  acknowledgements)  could  be  shorter. 

For  the  last  three  assumptions,  simulations  have  shown  that  queues  of 
modest  size  can  give  essentially  the  same  performance  as  infinite  queues 
(see  Figure  8  and  also  Norton  and  Pfister  [21]).  Interleaved  memory  may 
make  the  patterns  of  access  to  MMs  essentially  uniform.  PE  processes 
cooperating  on  the  same  task  will  certainly  not  be  independent,  but  their 
patterns  of  memorv'  access,  as  seen  by  the  network  after  mediation  through 
a  cache,  may  not  be  significantly  correlated. 
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3.3.   Switch  structure 

In  the  current  design  we  have  chosen  to  use  time-muUiplexing,  with  each 
message  divided  into  one  packet  containing  the  path  descriptor,  address, 
and  opcode,  plus  one  or  more  data  packets.'^ 

The  protocol  used  to  transmit  messages  between  switches  is  a  message- 
level  rather  than  packet-level  protocol.  Tliat  is,  packet  transmission  cannot 
be  halted  in  the  middle  of  a  message.  A  switch  will  accept  a  new  message 
only  if  the  available  space  in  its  queues  guarantees  that  it  will  be  able  to 
receive  the  entire  message. 

TTie  switch  is  designed  to  meet  the  following  goals: 

•  Distinct  data  paths  do  not  interfere  with  each  other.  Therefore,  a  new 
message  can  be  accepted  at  each  input  port  provided  queues  are  not 
full.  In  addition,  a  message  destined  to  leave  at  some  output  port  will 
not  be  prevented  from  doing  so  by  a  message  routed  to  a  different 
output  port. 

•  A  packet  entering  a  switch  with  empty  queues,  when  no  other  message 
is  destined  for  the  same  output  port,  leaves  the  switch  at  the  next  cycle. 

•  The  capability  to  combine  and  de-combine  memory  requests  does  not 
unduly  slow  the  processing  of  requests  that  are  not  to  be  combined. 

Figure  7  shows  a  block  diagram  of  a  switching  node.  TTie  "PE  port" 
connects  to  either  a  PE  or  to  an  MM  port  of  a  preceding  network  stage 
and  the  "MM  port"  connects  to  either  an  MM  or  a  PE  port  of  a  subsequent 
network  stage. 

Associated  with  each  MM  port  is  a  combining  queue  capable  of  accepting 
a  packet  simultaneously  from  each  PE  port.  Requests  that  have  been 
combined  with  other  requests  are  sent  to  a  wait  buffer  at  the  same  time  as 
the  combined  request  is  sent  to  the  MM  port. 

From  each  MM  port,  a  reply  enters  both  the  associated  wait  buffer  and 
the  noncombining  queue  associated  with  the  PE  port  to  which  the  reply  is 
routed.  The  wait  buffer  inspects  all  responses  from  MMs  and  searches  for 
a  response  to  a  previously  combined  request.  When  it  finds  a  response  to 
such  a  request,  it  generates  a  second  response  and  deletes  the  request  from 
its  memory.  Each  noncombining  queue  has  four  inputs  since  messages  may 
come  from  both  MM  ports  and  from  both  wait  buffers. 

If  a  full  2x2  bidirectional  switch  cannot  be  constructed  on  a  single  chip, 
packaging  alternatives  include  dividing  each  switch  into  a  forward  path 
component  (FPC),  consisting  of  the  two  combining  queues,  and  a  return 
path  component  (RPC)  consisting  of  the  wait  buffers  and  noncombining 
queues.  Data  forwarded  to  a  wait  buffer  from  a  combining  queue  are 
transmitted  from  the  FPC  to  the  RPC  via  two  wait  buffer  output  and  input 

"  At  the  expense  of  a  severe  increase  in  complexity,  the  address  can  also  be  transmitted  in 
more  than  one  packet  (Snir  and  Solworth  [26]). 
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Figure  7.  Block  diagram  of  a  switching  node. 


ports  on  the  FPC  and  RPC,  respectively.  Furthermore,  the  wait  buffers  may 
be  constructed  on  chips  separate  from  the  noncombining  queues.  In  addi- 
tion, pin  limitations  may  require  multiplexing  the  wait  buffer  data  from  the 
FPC  to  the  RPC  over  a  single  port. 

We  do  not  actually  implement  two-input  combining  queues  or  four-input 
noncombining  queues.  Such  designs,  while  possible  (see  Snir  [25]),  involve 
a  considerable  increase  in  complexity.  Instead,  to  achieve  the  goal  of  a 
switch  where  there  is  no  interference  between  distinct  data  paths,  we  include 
a  separate  one-input  queue  on  each  input-output  path,  in  all  four  queues 
for  a  2x2  FPC.  Simulations  indieate  (see  Figure  8)  that  performance  of 
our  scheme  (four  one-input  queues,  two  of  which  are  multiplexed  at  each 
output  port  with  a  selector  that  alternates  when  both  queues  have  data,  but 
always  presents  data  if  only  one  queue  has  data)  is  very  close  to  that  of  a 
scheme  with  two-input  queues  and  equivalent  total  queue  capacity.  Perform- 
ance appears  to  be  superior  to  that  of  a  design  with  multiplexing  of  input 
data  into  two  one-input  queues  associated  with  each  output  port  (see  Kumar 
and  Jump  [16]).  A  disadvantage  of  our  scheme  is  that  messages  from  distinct 
input  ports  cannot  be  combined.  They  would,  however,  leave  via  the  same 
output  port  and  this  may  combine  one  stage  later.  On  the  return  path,  the 
four-input  noncombining  queue  can  be  replaced  with  four  one-input  queues. 
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Figure  8.  i  a )  Effective  throughput  versus  queue  size  (64  PE  network,  memor\'  cycle 
time  100  ns,  sv.iich  cycle  50  ns,  maximum  offered  load),  (b)  Latency^  versus  queue 
size  (64  PE  network,  memory  cycle  time  100  ns,  switch  cycle  50  ns,  maximum  offered 
load  1. 


Depending  on  the  protocol  used,  the  queues  fed  by  the  wait  buffers  may 
only  require  space  for  one  message. 

Details  of  the  current  switch  design  and  a  description  of  an  implementa- 
tion for  a  planned  32-PE  prototype  can  be  found  in  Dickey  ei  al.  [2],  [3] 
and  Gottlieb  [7]. 

3.4.    Combining  and  its  implementation  cost 

Our  design  for  combining  and  noncombining  queues  is  an  enhancement  of 
the  VLSI  systolic  queue  of  Guibas  and  Liang  [11].  They  present  a  FIFO 
buffer  where  an  insertion  or  deletion  can  be  performed  every  four  cycles, 
and  where  no  global  control  signals  are  used,  other  than  the  clock  signals 
used  by  the  two-phase  logic.  We  use  a  modified  version  of  this  structure, 
where  insertions  and  deletions  can  be  made  at  each  cycle.  To  achieve  this 
we  resort  to  an  increased  number  of  global  control  signals.  In  combining 
queues,  comparators  are  added  to  the  basic  queue  structure  to  detect  requests 
which  are  to  be  combined. 

A  combining  queue  consists  of  three  columns:  an  IN  column,  an  OUT 
column,  and  a  CHUTE  column  (see  Figure  9).  Packets  added  to  the  queue 
enter  the  IN  column  and  move  up  the  column  each  cycle  until  the  adjacent 

'  Note  latency  is  measured  only  through  the  network  and  the  memor)-;  requests  are  not 
time-stamped  and  queued  when  the  processor  is  blocked.  It  is  a  measure  of  the  total  transit 
time  ai  the  maximum  effecti\'e  throughput  shown  above. 
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Figure  9.  Systolic  queue  design. 


slot  in  the  OUT  column  is  empty.  In  the  absence  of  combining,  they  then 
move  over  to  the  OUT  column  and  begin  moving  down.  Should  a  packet 
reach  the  end  of  the  IN  column  without  being  able  to  move  to  the  OUT 
column,  the  queue  is  declared  full,  and  no  more  messages  can  be  accepted. 

Hie  above  horseshoe  movement  pattern  ensures  that,  as  a  new  message 
moves  up  the  IN  column,  it  passes  all  messages  in  the  queue  at  the  time 
of  its  arrival.  If  the  address  of  a  message  in  the  IN  column  matches  the 
address  of  a  message  in  the  adjacent  slot  of  the  OUT  column,  the  item  in 
the  IN  column  is  shunted  over  to  the  CHUTE,  where  it  proceeds  down  the 
CHUTE  in  tandem  with  the  corresponding  message  in  the  OUT  column. 
The  two  messages  exit  the  queue  at  the  same  time.  Combining  logic  then 
sends  a  combined  request  (described  below)  to  the  OUT  port  and  appropri- 
ate information  to  allow  de-combining  to  the  wait  buffer.  Packetizing  of 
messages  and  the  blocking  of  the  queue  somewhat  complicates  this  logic, 
since  only  address  packets  are  to  be  compared. 

For  loads,  the  combined  request  contains  the  op-code  (load)  and  the 
address  (including  PE  address)  of  one  of  the  requests;  for  stores,  the  same 
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Figure  10.  Schematic  of  a  combining  queue  cell. 


plus  one  of  the  values  to  be  stored."  In  each  case,  the  wait  buffer  must 
receive  the  op-code  and  the  two  PE  addresses  of  the  combined  requests, 
plus  the  memop,'  address  or  some  other  unique  identifier  for  the  message 
that  has  been  combined."  For  fetch-and-add  operations,  the  combined 
request  containing  the  sum  of  the  two  increments  is  forwarded  to  the  next 
stage  while  the  increment  from  the  OUT  column  is  stored  in  the  wait  buffer. 
Upon  decombining,  the  request  that  arrived  first  will  receive  the  original 
value  of  the  memory  location,  while  the  second  request  will  receive  the 
original  value  plus  the  increment  saved  from  the  first. 

A  schematic  for  a  single  data  bit  (containing  one  slice  of  the  IN,  OUT, 
and  CHUTE  columns)  is  shown  in  Figure  10.  FI,  HI,  FO,  and  HO  are 
active  during  the  first  clock  phase  and  are  computed  during  the  previous 
clock  phase  from  global  queue  full  and  queue  blocked  status  signals.  OTRV, 
OTRH,  CTRV  and  CTRH  are  active  during  the  second  clock  phase  and 
computed  during  the  previous  clock  phase  from  the  empty  status  of  the 
OUT  and  CHUTE  slots.  The  MATCH  line  is  precharged  during  the  first 
clock  phase  and  is  evaluated  during  the  second  clock  phase.  It  is  used 

*The  serialization  principle  permits  us  to  discard  the  other  value. 

*  If  the  memor\  address  is  used,  a  restnaion  must  be  added  to  the  logic  of  the  PE  to  have 
only  one  combinable  request  pending  to  a  single  memory  location. 
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during  that  phase  to  indicate  whether  the  IN  or  CHUTE  slots  will  be  marked 
as  occupied.  In  CMOS  technology  the  additional  cost  of  combining  in  a 
given  queue  cell  amounts  to  27  transistors  out  of  a  total  55. 

This  design  allows  only  two-way  combining  of  messages.  If  a  third 
message  to  a  location  arrives  at  a  stage  where  two  requests  to  that  location 
are  queued,  only  two  of  them  will  be  combined.  Recent  work  indicates  that 
two-way  combining  may  not  be  sufficient;  according  to  Lee  et  al.  [19], 
three-way  combining  is  required  to  avoid  saturation  of  the  network  by 
hotspot  requests  for  networks  with  a  large  number  of  stages.  The  above 
design  could  be  modified  for  three-way  combining  by  the  addition  of  another 
CHUTE  column.  This  would  involve  an  increase  in  complexity  of  control 
logic  for  the  combining  queue  as  well  as  for  the  wait  buffer. 

Assuming  the  same  total  amount  of  buffering  in  nodes  on  the  forward 
and  return  paths,  our  current  designs  indicate  that  the  silicon  area  required 
for  an  implementation  of  a  2  x  2  network  node  that  does  combining  will  be 
slightly  less  than  double  that  of  a  noncombining  switch.  This  is  based  on 
assumptions  that  a  combining  queue  will  be  roughly  twice  the  size  of  a 
noncombining  queue  with  equal  message  capacity,  and  that  the  wait  buffer 
will  be  approximately  75%  the  size  of  a  noncombining  queue.  Due  to  the 
overlap  of  the  computation  of  control  information  with  data  transmission, 
we  estimate  an  increase  in  cycle  time  of  only  10  to  20%. 

4.  \'LSI  design  status 

In  preparation  for  the  design  of  a  complete  combining  switch  node,  we 
have  designed  several  chips  which  have  been  fabricated  by  DARPA's 
MOSIS  facility. 

We  have  received  functional  11 -bit  wide  2x2  noncombining  forward 
path  chips  containing  approximately  7500  transistors  and  fabricated  in 
3-micron  NMOS.  These  parts  operate  at  a  clock  speed  of  23  MHz  with 
propagation  delays  from  clock  to  output  of  approximately  25  manoseconds. 
Power  dissipation  is  approximately  1.5  w.  A  4  x  4  test  network  was  construc- 
ted using  four  of  these  parts  and  functioned  as  expected. 

We  have  also  had  a  6-bit  wide  portion  of  the  FPC  (without  the  adder) 
for  a  2x2  combining  switch  fabricated  in  4-micron  NMOS.  This  switch  is 
composed  of  four  one-input  combining  queues.  Tliese  parts  also  operate 
as  expected  and  have  performance  and  power  dissipation  similar  to  the 
noncombining  switches. 

Since  the  final  combining  switches  must  be  at  least  32-bits  wide  and 
air-cooled,  we  have  converted  our  design  effort  to  MOSIS's  newly  available 
scalable  double-metal  CMOS  process,  which  promises  minimum  feature 
sizes  as  small  as  1.4  microns.  We  have  submitted,  and  are  awaiting  the 
fabrication  of,  a  35-bit  noncombining  forward  path  using  this  CMOS 
technology. 
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We  are  currently  completing  the  design  of  the  remaining  components  (a 
32-bit  adder  and  the  associative  wait  buffer)  and  hope  to  breadboard  a 
complete  (albeit  narrow)  combining  switch  node  later  this  academic  year. 
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